Wiederholung Tools and Programming
Lecture 1 - Basics
Working directory
Please check first where you are by: getwd()
If you would like to set a new working directory use the following command: setwd(“/Users/alexander/Documents/Master Kiel/Tools and Programming Languages”)
Simple mathematical operations
[1] 4
[1] 2
[1] 2
[1] 12
[1] 100
[1] 10
Assignment operator (<-)
Inspect your objects
View(e) (Will only be operated in the R-Studio inveroment)
Inspect your first dataset
df <- iris # alternatively write datasets::iris to make the source explicit
head(df) # First 5 observations Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
[1] "data.frame"
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
[5] "Species"
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
[1] "setosa" "versicolor" "virginica"
[1] "list"
$names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
[5] "Species"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[70] 70 71 72 73 74 75
[ reached getOption("max.print") -- omitted 75 entries ]
Why df ist class = data frame and type = list
‘mode/type’ is a mutually exclusive classification of objects according to their basic structure. The ‘atomic’ modes are numeric, complex, character and logical. Recursive objects have modes such as ‘list’ or ‘function’ or a few others. An object has one and only one mode.
‘class’ is a property assigned to an object that determines how generic functions operate with it. It is not a mutually exclusive classification. If an object has no specific class assigned to it, such as a simple numeric vector, it’s class is usually the same as its mode, by convention.
Install a package
install.packages(‘dplyr’) (Will only be operated in the R-Studio environment)
First manipulation of a data frame
Different filter functions
First error and the help function
[1] NA
help(sum) (Will only be operated in the R-Studio environment) Default: sum(…, na.rm = FALSE) Change the default setting to operate
[1] 36
Lecture 2 - Basics
Insert data frame from the package or from the working directory
df <- read.csv(“name.csv”)
Exploring of the data
[1] "data.frame"
[1] "list"
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
[1] 32
[1] 11
[1] 32 11
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
[1] 32
[1] 11
[1] 32 11
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
$names
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
$row.names
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
$class
[1] "data.frame"
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
Subsetting of the data
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
mpg cyl disp
Mazda RX4 21.0 6 160.0
Mazda RX4 Wag 21.0 6 160.0
Datsun 710 22.8 4 108.0
Hornet 4 Drive 21.4 6 258.0
Hornet Sportabout 18.7 8 360.0
Valiant 18.1 6 225.0
Duster 360 14.3 8 360.0
Merc 240D 24.4 4 146.7
Merc 230 22.8 4 140.8
Merc 280 19.2 6 167.6
Merc 280C 17.8 6 167.6
Merc 450SE 16.4 8 275.8
Merc 450SL 17.3 8 275.8
Merc 450SLC 15.2 8 275.8
Cadillac Fleetwood 10.4 8 472.0
Lincoln Continental 10.4 8 460.0
Chrysler Imperial 14.7 8 440.0
Fiat 128 32.4 4 78.7
Honda Civic 30.4 4 75.7
Toyota Corolla 33.9 4 71.1
Toyota Corona 21.5 4 120.1
Dodge Challenger 15.5 8 318.0
AMC Javelin 15.2 8 304.0
Camaro Z28 13.3 8 350.0
Pontiac Firebird 19.2 8 400.0
[ reached 'max' / getOption("max.print") -- omitted 7 rows ]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
Hornet 4 Drive 21.4 110
Hornet Sportabout 18.7 175
Valiant 18.1 105
Duster 360 14.3 245
Merc 240D 24.4 62
Merc 230 22.8 95
Merc 280 19.2 123
Merc 280C 17.8 123
Merc 450SE 16.4 180
Merc 450SL 17.3 180
Merc 450SLC 15.2 180
Cadillac Fleetwood 10.4 205
Lincoln Continental 10.4 215
Chrysler Imperial 14.7 230
Fiat 128 32.4 66
Honda Civic 30.4 52
Toyota Corolla 33.9 65
Toyota Corona 21.5 97
Dodge Challenger 15.5 150
AMC Javelin 15.2 150
Camaro Z28 13.3 245
Pontiac Firebird 19.2 175
Fiat X1-9 27.3 66
Porsche 914-2 26.0 91
Lotus Europa 30.4 113
Ford Pantera L 15.8 264
Ferrari Dino 19.7 175
Maserati Bora 15.0 335
Volvo 142E 21.4 109
Lecture 3 - Basics
Data types
Vectors constitute the most important family of data types in R. There are two fundamentally different types of vectors, atomic vectors and lists. Atomic vectors have homogeneous element types, i.e. they cannot mix numbers, characters and logical values, whereas lists can have heterogeneous element types.
Atomic vectors
char_scalar <- "data" # character vector of legth 1
int_scalar <- 1L # integer vector of length 1
dbl_scalar <- 2.1 # double vector of lenght 1
lgl_scalar <- TRUE # logical vector of lengh 1
char_scalar[1] "data"
[1] 1
[1] 2.1
[1] TRUE
# The c() function combines multiple smaller vectors into one larger vector
char <- c("data", "science", "rocks")
int <- c(1L, 10L, 100L)
dbl <- c(2.1, pi, 100)
lgl <- c(TRUE, FALSE, FALSE)
char [1] "data" "science" "rocks"
[1] 1 10 100
[1] 2.100000 3.141593 100.000000
[1] TRUE FALSE FALSE
Task 1: Confirm the internal storage type of the atomic vectors char, int, dbl, lgl using the function typeof? Use the classfunction to check their classes?
[1] "character"
[1] "character"
Type coercion
What happens if we combine different types in one vector? R automatically coerces to the more flexible type
[1] "data" "1" "10" "100"
[1] "character"
Not everything is coerced as you might wish: functions may require specific input types:
[1] 2
[1] 2
Error in sum(char): ungültiger 'type' (character) des Argumentes
Task 2: Rank the 4 common atomic vector types from most / least flexible. To answer this question create different cominations of the atomic vectors int, dbl, lgl, and char using the c() function and evaluate the type of the combined vector via typeof().
Factor, date, and datetime objects
There are several other vector classes built on top of atomic vectors. The most important ones are factor, Date and the datetime object POSiXct. They are stored as atomic vectors with attached attributes. They all have special properties that are helpful in practice. For instance, the Date class enables us to calculate with dates, sort dates, and print dates in a readable way.
Factors
Factors are built on top of integer vectors. They are stored as integers with attached labels. Factors are often useful for representing ordered or unordered categorical data.
Let’s turn a character vector of clothing sizes sold in a shop into a factor:
sizes_char <- c("XXL","S","M","S","L","M", "S", "XXL") # character vector
order <- c("S", "M", "L", "XXL") # define how clothing sizes are ordered
sizes_factor <- factor(sizes_char, levels=order) # created ordered factor vector
sizes_factor[1] XXL S M S L M S XXL
Levels: S M L XXL
Task 3: Let’s try to understand better what a factor is. Apply the functions class, typeof, str, and attributes onsizes_factor and observe the output.
[1] "factor"
[1] "integer"
Factor w/ 4 levels "S","M","L","XXL": 4 1 2 1 3 2 1 4
$levels
[1] "S" "M" "L" "XXL"
$class
[1] "factor"
class: class of the objecttypeof: storage mode of the objectattributes(): list of the object’ s attributesstr(): compact display of the object’s internal structure
Task 4: To see why factor variables are sometimes useful, compare the output of the summary function for the vectors sizes_char and sizes_factor. Which output is more useful?
Length Class Mode
8 character character
S M L XXL
3 2 1 2
Date
The date class represents dates as the number of days since 1970-01-01 and internally stores them as double vector. This enables us to sort dates and calculate with dates: add, subtract, create date sequences, etc.
today <- Sys.Date()
tomorrow <- today + 1
year_seq <- seq(today, length.out=5, by="1 year")
today
tomorrow
year_seq[1] "2019-12-13"
[1] "2019-12-14"
[1] "2019-12-13" "2020-12-13" "2021-12-13" "2022-12-13" "2023-12-13"
Let’s confirm that 1970-01-01 is day 0 for the R Date class
[1] 0
How many days have passed since day 0?
Time difference of 18243 days
Task 5: How are dates prior to 1970-01-01 stored? Let’s try 1969-12-31?
Datetime (POSIXct)
The POSIXct class represents datetimes as the number of seconds since 1970-01-01 00:00:00 and internally stores them as double vector. This enables us to sort datetimes and calculate with datetimes: add, subtract, create date sequences, etc.
current_time <- Sys.time()
hour_seq <- seq(current_time, length.out=5, by="10 day")
current_time
hour_seq[1] "2019-12-13 13:41:51 CET"
[1] "2019-12-13 13:41:51 CET" "2019-12-23 13:41:51 CET"
[3] "2020-01-02 13:41:51 CET" "2020-01-12 13:41:51 CET"
[5] "2020-01-22 13:41:51 CET"
Task 6: Check above which timezone is displayed? Is it CEST (central european standard time) or GMT/UTC (Greenwich mean time) or some other time? If you wanted to change your timezone setting, consult help(timezones).
Matrices and arrays
All atomic vectors can be turned into a matrix (2-dimensional) or an array (multi-dimensional) via a dimension attribute. Internally matrices and arrays are stored as atomic vectors, but R treates them differently. If you apply functions on atomic vectors, matrices and arrays, different things will happen.
Let’s create an integer vector and convert it into a matrix and an array.
Let’s check how matrices and arrays are printed
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 3 5 7 9 11
[2,] 2 4 6 8 10 12
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
Let’s analyse the nature of matrices and arrays as compared to the integer vector using the functions typeof, class and dim
# The sapply function applies a function (e.g. typeof) to each element of the provided list of objects.
sapply(X=list(int,mat,arr), FUN=class) # Class[1] "integer" "matrix" "array"
[1] "integer" "integer" "integer"
[[1]]
NULL
[[2]]
[1] 2 6
[[3]]
[1] 2 3 2
Task 7: Above we turned a vector into a matrix and an array using the dedicated functions matrix and array. But there’s another way to achieve this: by setting the dimension attribute (dim). Try some several 2 or 3 dimensional combinations. Observe how printing and the class changes.
alphabet <- letters[1:20] # Creates char vector of first 20 letters
dim(alphabet) <- c(4,5) # Play around with c(dim1, dim2, [dim3])
alphabet # Print the object [,1] [,2] [,3] [,4] [,5]
[1,] "a" "e" "i" "m" "q"
[2,] "b" "f" "j" "n" "r"
[3,] "c" "g" "k" "o" "s"
[4,] "d" "h" "l" "p" "t"
[1] "matrix"
List
Lists are 1-dimensional objects, just like atomic vectors. But different from atomic vectors the list elements can be heterogeneous. A list can combine vectors with data frames, arrays and any other R object (functions, formulas, etc.).
Let’s create a simple list containing 3 different vector classes (integer, character, Date) and inspect how the list is printed.
simple_list <- list(numbers=1:10,
letters=LETTERS[1:10],
dates=seq(Sys.Date(),length.out =10, by="1 day")
)
simple_list$numbers
[1] 1 2 3 4 5 6 7 8 9 10
$letters
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
$dates
[1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
[6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"
Lists are sometimes called recursive vectors, because one can infinitly nest lists within lists. As an example consider the (simplified) representation of the order of Primates.
primates <- list(
name = "Primates",
Lemures = list(name = "Lemures"),
Hominoidea = list(
name = "Hominoidea",
Orangutans = list(name = "Orang-Utans"),
Homininae = list(
name = "Homininae",
Gorillas = list(name = "Gorillas"),
Hominini = list(
name = "Hominini",
Chimpanzees = list(name = "Chimpanzees"),
Humans = list(name = "Humans"))))) Task 8: To see why nested lists can be useful, install the data.tree package and run the code chunk below for a plot of the greatly simplified phylogenetic tree of primates.
Data frame
Data frames are fundamental to data analysis and machine learning. Data frames are 2-dimensional like matrices, but can combine heterogeneous data types across columns. In terms of structure, a data frame is essentially a list of equal-length vectors with attributes for the column names (names), row names (row.names) and its class (data.frame).
Since the simple list from above consists of equal length vectors we can convert this list into a data frame:
numbers letters dates
1 1 A 2019-12-13
2 2 B 2019-12-14
3 3 C 2019-12-15
4 4 D 2019-12-16
5 5 E 2019-12-17
6 6 F 2019-12-18
7 7 G 2019-12-19
8 8 H 2019-12-20
9 9 I 2019-12-21
10 10 J 2019-12-22
Task 9: To better understand the nature of data frames apply the functions class, typeof, attributes and str on the data frame df.
[1] "data.frame"
[1] "list"
$names
[1] "numbers" "letters" "dates"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
'data.frame': 10 obs. of 3 variables:
$ numbers: int 1 2 3 4 5 6 7 8 9 10
$ letters: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
$ dates : Date, format: "2019-12-13" "2019-12-14" ...
Task 10: Attributes can be changed. First, change the column names by assigning c("Zahlen", "Buchstaben", "Daten") to the names attribute. Second, check whether changing the class attribute to “list” actually suffices to create a list.
df2 <- df %>%
select(Zahlen = numbers, Buchstaben = letters, Daten = dates)
names(df) <- c("Zahlen", "Buchstaben", "Daten")Generic functions
An important family of R functions is called S3 generic functions. Examples of generic functions are summary, print, plot, and mean. Generic functions interact with the class attribute of the functions first argument in a special way. Depending on the class, generic functions will do different things.
Let’s apply the generic function summary to date, factor, and character vectors.
Min. 1st Qu. Median Mean 3rd Qu.
"2019-12-13" "2020-12-13" "2021-12-13" "2021-12-12" "2022-12-13"
Max.
"2023-12-13"
S M L XXL
3 2 1 2
Length Class Mode
8 character character
Do we see that the summary function performs three different summary operations for the three different classes of objects? Let’s focus on one of the objects, the vector sizes_factor and illustrate what the summaryfunction is doing under the hood:
- R checks the
classof the vectorsizes_factor - R checks whether there is a dedicated summary method for the Date class (
summary.factor) - If yes, the
summary.factormethod is applied. If no, thesummary.defaultmethod is applied
[1] "factor"
[1] TRUE
S M L XXL
3 2 1 2
Task 11: Use the methods command to get a list of all methods that can be invoked by the generic functions summary and mean.
[1] mean.Date mean.default mean.difftime mean.POSIXct
[5] mean.POSIXlt mean.quosure* mean.vctrs_vctr*
see '?methods' for accessing help and source code
[1] summary.aov summary.aovlist*
[3] summary.aspell* summary.check_packages_in_dir*
[5] summary.cohesiveBlocks* summary.connection
[7] summary.data.frame summary.Date
[9] summary.default summary.ecdf*
[11] summary.factor summary.gexf*
[13] summary.ggplot* summary.glm
[15] summary.hcl_palettes* summary.igraph*
[17] summary.infl* summary.lm
[19] summary.loess* summary.manova
[21] summary.matrix summary.mlm*
[23] summary.nls* summary.packageStatus*
[25] summary.POSIXct summary.POSIXlt
[27] summary.ppr* summary.prcomp*
[29] summary.princomp* summary.proc_time
[31] summary.rlang_error* summary.rlang_trace*
[33] summary.srcfile summary.srcref
[35] summary.stepfun summary.stl*
[37] summary.table summary.tukeysmooth*
[39] summary.vctrs_sclr* summary.vctrs_vctr*
[41] summary.warnings summary.XMLInternalDocument*
see '?methods' for accessing help and source code
Selecting parts of an object
We often wish to select parts of an object for the purpose of extraction or replacement. We have mainly three operators to chose , [, [[ and $.
Multiple elements
The operator [ allows extracting any element or any combination of elements of an R object. Within the square brackets we specify along which dimensions we want to select, writing [dim1, dim2, ...]. In case of matrices and data frames this is [row, col].
In general, the dimension arguments within the square brackets can take three different forms:
- Numeric vector
- Logical vector
- Character vector
Atomic vector
a b c d e
1 2 3 4 5
Extracting
int[1] # R uses 1-based indexing
int[2:3] # R uses 1-based indexing
int[c(1,3)] # numeric
int[c(TRUE, FALSE, TRUE, FALSE, FALSE)] # logical
int[c("a", "c")] # character (matched to names)a
1
b c
2 3
a c
1 3
a c
1 3
a c
1 3
Replacing
Note that the dimensions have to match!
a b c d e
100 2 300 4 5
Matrix
mat <- matrix(1:12, nrow = 3, ncol=4)
rownames(mat) <- paste0("r", 1:3) # matrices can have named dimensions
colnames(mat) <- paste0("c", 1:4)
mat c1 c2 c3 c4
r1 1 4 7 10
r2 2 5 8 11
r3 3 6 9 12
c3 c4
r1 7 10
r2 8 11
c3 c4
r1 7 10
r2 8 11
c3 c4
r1 7 10
r2 8 11
c1 c2 c3 c4
r1 1 4 7 10
r2 2 5 8 11
Special cases
Negative selection
We can invert a selection using the - operator for numeric vectors and using the ! operator for logical vectors.
a c
100 300
b d e
2 4 5
a c
100 300
b d e
2 4 5
Single elements
There are two other important operators: [[ and $. They are usful in the context of lists and data frames (which are internally stored as lists) when we want to select only 1 element, e.g. one column of a data frame. While the [ operator preserves the list structure, [[ and $ enable us to navigate into the list structure
Let’s consider our simple list from above:
simple_list["dates"] # preserves the list class
simple_list[["dates"]] # extracts a Date vector
simple_list$dates # also extracts a Date vector$dates
[1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
[6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"
[1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
[6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"
[1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
[6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"
The advantage becomes aparent for nested lists. Different from [, the operators [[ and $ allow navigating deeply into a nested list, and extract or replace elements there.
Task 12: Use the $ operator multiple times to navigate through the nested list primates until you reach humans. Don’t type everything by hand. Use autocompletion by pressing tab after each $ operator.
[1] "Humans"
Given that $ is less verbose and easier to read, why and when should we use [[ instead? Answer: In situations where we want to pass the selection as a variable.
selected_element <- "numbers"
str(simple_list)
simple_list[[selected_element]] # works fine
simple_list$selected_element # does not workList of 3
$ numbers: int [1:10] 1 2 3 4 5 6 7 8 9 10
$ letters: chr [1:10] "A" "B" "C" "D" ...
$ dates : Date[1:10], format: "2019-12-13" "2019-12-14" ...
[1] 1 2 3 4 5 6 7 8 9 10
NULL
Application
Data frame
Now, let’s focus on dataframes and on some applications of extraction and replacement that we often see in practice. But note that we will cover packages like dplyr or data.table later in class, which make subsetting data frames much more convenient.
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[1] 1 2 3 4 5
Task 13: The iris data frame contains information for 150 flowers. First, extract all 12 observations with sepal length larger than 7. Second, for these flowers only select the Species column. Third, change the species names of all flowers to capital letters using the toupper function. Fourth, create an additional variable sepal_length_100 by multiplying Sepal.Length by factor 100.
df <- iris
# We explicitly coerce from factor to character. Otherwise toupper won't work.
df$Species <- as.character(df$Species)
# First
condition <- df$Sepal.Length > 7
df1 <- df[condition, ]
condition2 <- df1$Petal.Width == 2.1
df1.2 <- df1[condition2,]
df1.2 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
103 7.1 3 5.9 2.1 virginica
106 7.6 3 6.6 2.1 virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
103 7.1 3.0 5.9 2.1 virginica
106 7.6 3.0 6.6 2.1 virginica
108 7.3 2.9 6.3 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
123 7.7 2.8 6.7 2.0 virginica
126 7.2 3.2 6.0 1.8 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
136 7.7 3.0 6.1 2.3 virginica
# Second
df2 <- df[condition,"Species"]
# Third
df3 <- toupper(df$Species)
#Fourth
df4 <- df
df4$sepal_length_100 <- df4$Sepal.Length*100Task 14: In supervised machine learning its common practice to split the data randomly into a training set (70% of the observations) and a test set (30%). Perform this split for the iris data frame, and use negative selection to create the test data.
Loops and their alternatives
Instead of using loops it is often more R-like to use apply (base R) or map (purrr package). However, using a vectorized function like + or *, if available, is the preferred solution because the code is more performant.
As an example consider the following simple_vector. First, we want to multiply each element of this vector by factor 2 using a for loop:
simple_vector <- 1:100
doubled_vector <- simple_vector # First we initialize the doubled_vector
for (x in simple_vector){
doubled_vector[x] <- x*2 # Then we iterate through each element and multiply by 2 each time
}
doubled_vector [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
[18] 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68
[35] 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102
[52] 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136
[69] 138 140 142 144 146 148 150
[ reached getOption("max.print") -- omitted 25 entries ]
Now, let’s reformulate this operation using a member of the apply family of functions:
# Here we make use of an anonymous function.
# Since it is anonymous it cannot be used elsewhere
doubled_vector <- sapply(simple_vector, function(x) x*2) Task: Reformulate this operation using the vectorized * function. This is the most efficient and syntactically easiest way to do it:
Get data in and out
Task: Use one of the data frames of the datasets package to experiment with importing and exporting. And to practice some of the other aspects of base R, covered before. Experiment with subsetting, extract or change columns of the data frame, inspect or change (coerce) the data types of columns, etc. Then you can write to .csv, .RData and .Rds. Clear the environment and import the data back again.